After centering and re-scaling the data, I ran a for loop to generate an elbow plot so I could determine which k I should use to define my clusters.
I tried several k values and settled on 7 after comparing how well the clusters could designate red vs white wine.
After trying out various parameters for preliminary visualization, I found that the data points separated the most with total.sulfur.dioxide and free.sulfur.dioxide parameters. These properties will later be shown to be the most crucial aspects in the principal component definitions.
The PCA for this data was able to explain the vast majority of the information with a single dimension. I elected to choose 2 for since there are two distinct types of wine in the data set and for visualization purposes.
When graphing the principal components on one another, there is a clear separation in the two colors of wine. Specifically, reds appear to have low magnitudes of PC1 and PC2 and whites appear to have high magnitudes of PC1 and PC2.
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 33 rows containing missing values (geom_point).
The quality ratings were more difficult to visualize as they relate to the clusters due to the number of values in the features. What is noticeable is that quality does not appear to be associated with either of the principal components, but is rather spread out through the data in the 5-6 range. We will return to this fact at the final table.
As state prior, the dominant force in the principal components was the sulfur content. In PC1 total.sulfur.dioxide stood out as the classifying feature from free.sulfur dioxide by several orders of magnitude.
PC2 is defined by free.sulfur.dioxide as well as a lack of total.sulfur.dioxide, the main feature of PC1.
| clusterID_wine | color | Count | PC1 | PC2 | Average_Quality |
|---|---|---|---|---|---|
| 1 | red | 20 | -17.753428 | 3.5167563 | 5.200000 |
| 1 | white | 980 | 33.226994 | 0.1943185 | 5.937755 |
| 2 | red | 2 | 14.245392 | 14.4058073 | 6.000000 |
| 2 | white | 1468 | 59.104686 | 3.4238866 | 5.621935 |
| 3 | red | 624 | -75.765873 | 1.2820173 | 5.908654 |
| 3 | white | 15 | 12.442049 | -10.8035473 | 4.933333 |
| 4 | red | 9 | 35.802692 | -14.3143548 | 5.888889 |
| 4 | white | 1215 | 2.861806 | -5.7205112 | 5.677366 |
| 5 | red | 885 | -70.367496 | 1.7985294 | 5.428249 |
| 5 | white | 51 | -19.580171 | -8.5305425 | 4.725490 |
| 6 | red | 32 | -58.731518 | 3.1655819 | 6.500000 |
| 6 | white | 1162 | -7.923514 | -0.0300585 | 6.432014 |
| 7 | red | 27 | -65.647080 | -0.9842082 | 5.333333 |
| 7 | white | 6 | 70.537750 | -11.8262665 | 4.666667 |
This table presents the noise in the data in a much more manageable way. What becomes immediately apparent is the clusters ability to determine which color a type of wine is. Cluster 2 is overwhelmingly white, while cluster 4 is decidedly red. It’s also clear that red wines are tend to have large negative values or small positive values for PC1 while white tend to have large positive values or low negative values for PC1, indicating that the main difference between red and white wine is the total.sulfur.dioxide present.
The average quality for each color/cluster demonstrates a disconnect between the chemical contents of the wine and it’s subjective score. While there is some variance between the clusters the range between the highest rated color/cluster does not exceed 2 scores of quality. These results could indicate that wine quality is not determined by chemical content. It could also mean that while vast differences in quality are easily identifiable, what separates a wine with score of 8 verses a score of 9 in terms of quality is negligible enough to be come obscured over successive tastings.
To answer this question, we performed some preliminary data cleaning/exploration, and then employed both PCA and K-means clustering to understand how the market was laid out. After extensively comparing the results of clustering and PCA, we came to the conclusion that PCA was identifying similar/the same trends as clustering, but was much harder to interpret and display graphically. So we chose to visualize our data by plotting the centroids of the clusters we identified along as meaningful axes as we could, do display the patterns in market segmentation. We chose 10 clusters for our analysis. Preliminary investigation (elbow plot) showed that there might be as little as 5 or 6 naturally, but we found that around 10 created a few very meaningful and highly distinct clusters that we wanted to single out for presentation. We did analyses with both scaled and unscaled variables, and have chosen to present the unscaled ones as the axes are easily measurable as “Mean number of tweets for users in a cluster” and all observations are already on the same scale in the original data (number of tweets).
This cluster, number , the “Chatters” cluster, is both a distinct category and also a useful baseline for outlining some general trends. Firstly, members of this cluster primarily use Twitter as a social platform to chat and share photos, and shop, with no particular topic or area of interest. We can also note here that there is very little spread on the leftmost graph of current_events, indicating that most Twitter users who follow ‘NutrientH20’ use twitter to talk about what’s going on in the world at least some of the time.
The second meaningful cluster we found, cluster , is our “Health and Fitness” cluster. This cluster primarily uses twitter to talk about fitness and health related activities and little else. There is also a second cluster of users that has similar twitter behavior, just at a lower volume.
The third cluster we found, cluster , is our “College Gamers” cluster. Members of this cluster primarily tweet about online gaming and their college/university, so we can assume they are primarily young college students. They also tweet about playing sports a little more than average.
Our fourth cluster, cluster , is our “Sports Dads” cluster. Members of this cluster primarily tweet about sports fandom and religion, and disproportionately more about parenting as well. Our analysis indicates that there is an “Average” Twitter user, that tweets a little about a wide variety of topics, but people who produce high volumes of tweets tend to tweet about very specific areas of interest. There were a number of other interesting clusters, like a “News & Politics” cluster that we found, but in an effort to keep the analysis relatively short, we chose these 4.
After pre-processing the groceries data, I feed the list of items and carts to the apriori function in arules. After a bit of trial and error, I decided to use the following parameters: support = 0.0012, confidence = 0.8, and maxlen = 10. I decided to decrease the support threshold because a high support requirement was not returning interesting associations. I increase the confidence threshold in order to filter out weaker associations, and my max length is set at 10 because I did not want length to be a binding constraint. I wanted to see all rules that meet my support and confidence threshold, regardless of length.
The following table lists the 10 rules with highest lift values. Recall that lift is a measure of how much the probability of observing the LHS increases when we condition on the RHS. The rule with the highest lift associates buying liquor and wine with buying bottled beer, which makes a lot of sense
| LHS | RHS | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|
| {liquor,red/blush wine} | {bottled beer} | 0.0019319 | 0.9047619 | 0.0021352 | 11.235269 | 19 |
| {other vegetables,rice,whole milk,yogurt} | {root vegetables} | 0.0013218 | 0.8666667 | 0.0015252 | 7.951182 | 13 |
| {oil,other vegetables,tropical fruit,whole milk} | {root vegetables} | 0.0013218 | 0.8666667 | 0.0015252 | 7.951182 | 13 |
| {pip fruit,sausage,sliced cheese} | {yogurt} | 0.0012201 | 0.8571429 | 0.0014235 | 6.144315 | 12 |
| {butter,curd,tropical fruit,whole milk} | {yogurt} | 0.0012201 | 0.8571429 | 0.0014235 | 6.144315 | 12 |
| {butter milk,other vegetables,pastry} | {yogurt} | 0.0012201 | 0.8000000 | 0.0015252 | 5.734694 | 12 |
| {fruit/vegetable juice,pastry,whipped/sour cream} | {yogurt} | 0.0012201 | 0.8000000 | 0.0015252 | 5.734694 | 12 |
| {citrus fruit,root vegetables,tropical fruit,whipped/sour cream} | {other vegetables} | 0.0012201 | 1.0000000 | 0.0012201 | 5.168156 | 12 |
| {oil,root vegetables,whole milk,yogurt} | {other vegetables} | 0.0014235 | 0.9333333 | 0.0015252 | 4.823612 | 14 |
| {citrus fruit,root vegetables,tropical fruit,whole milk,yogurt} | {other vegetables} | 0.0014235 | 0.9333333 | 0.0015252 | 4.823612 | 14 |
The next table lists the 10 rules with the highest support values and arranges them by confidence. Recall that confidence is the ratio between the support of LHS and RHS together and the support of the LHS alone. Not many of these rules are interesting because most of them simply associate popular items such as whole milk and other vegetables with other basic grocery bundles. In fact, most of the rules found here are not interesting for the same reason: they are simply associating commonly purchased grocery bundles.
| LHS | RHS | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|
| {rice,sugar} | {whole milk} | 0.0012201 | 1.0000000 | 0.0012201 | 3.913649 | 12 |
| {flour,root vegetables,whipped/sour cream} | {whole milk} | 0.0017285 | 1.0000000 | 0.0017285 | 3.913649 | 17 |
| {oil,other vegetables,root vegetables,yogurt} | {whole milk} | 0.0014235 | 1.0000000 | 0.0014235 | 3.913649 | 14 |
| {butter,domestic eggs,other vegetables,whipped/sour cream} | {whole milk} | 0.0012201 | 1.0000000 | 0.0012201 | 3.913649 | 12 |
| {citrus fruit,root vegetables,tropical fruit,whipped/sour cream} | {other vegetables} | 0.0012201 | 1.0000000 | 0.0012201 | 5.168156 | 12 |
| {cream cheese ,other vegetables,sugar} | {whole milk} | 0.0015252 | 0.9375000 | 0.0016268 | 3.669046 | 15 |
| {root vegetables,sausage,tropical fruit,yogurt} | {whole milk} | 0.0015252 | 0.9375000 | 0.0016268 | 3.669046 | 15 |
| {citrus fruit,domestic eggs,sugar} | {whole milk} | 0.0014235 | 0.9333333 | 0.0015252 | 3.652739 | 14 |
| {domestic eggs,sugar,yogurt} | {whole milk} | 0.0014235 | 0.9333333 | 0.0015252 | 3.652739 | 14 |
| {oil,root vegetables,whole milk,yogurt} | {other vegetables} | 0.0014235 | 0.9333333 | 0.0015252 | 4.823612 | 14 |
Any interesting associations found in this data are obscured by a select few popular items being purchased with in the majority of carts. This is clear from the following network graph of these association rules.
The Basic Set of Rules
I want to investigate those carts that do not include these popular items. To do this, I create a tag called no milk which I apply to those carts that do not contain whole milk. This tag will bring to the surface associations that were otherwise obscured by the strong association between whole milk and most carts. The following network graphs show the results.
Associations between whole milk and basic bundles now exist with bundles found in carts with no milk
The “no milk” tag forms a network of grocery carts distinct from the carts that contain “whole milk”
Rules associated with “whole milk” generally have higher lift than those associated with the “no milk” tag
The “no milk tag picks out more snack and alcohol items and less traditional grocery items”. Items associated with “whole milk” are typical grocery bundles
The following tables clearly show the same trend that the above graphs illustrate: items associated with the “no milk” tag tend to be snacks, alcohol, and other items that a non-grocery shopper might run into the store and buy while the “whole milk” is associated with basic, everyday grocery staples.
| LHS | RHS | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|
| {liquor,red/blush wine} | {no milk} | 0.0021352 | 1.0000000 | 0.0021352 | 1.343212 | 21 |
| {bottled beer,liquor} | {no milk} | 0.0046772 | 1.0000000 | 0.0046772 | 1.343212 | 46 |
| {cream cheese ,UHT-milk} | {no milk} | 0.0028470 | 1.0000000 | 0.0028470 | 1.343212 | 28 |
| {bottled beer,liquor,red/blush wine} | {no milk} | 0.0019319 | 1.0000000 | 0.0019319 | 1.343212 | 19 |
| {bottled beer,liquor,soda} | {no milk} | 0.0012201 | 1.0000000 | 0.0012201 | 1.343212 | 12 |
| {rolls/buns,soda,spread cheese} | {no milk} | 0.0013218 | 1.0000000 | 0.0013218 | 1.343212 | 13 |
| {misc. beverages,rolls/buns,tropical fruit} | {no milk} | 0.0012201 | 1.0000000 | 0.0012201 | 1.343212 | 12 |
| {bottled water,shopping bags,UHT-milk} | {no milk} | 0.0016268 | 1.0000000 | 0.0016268 | 1.343212 | 16 |
| {shopping bags,soda,UHT-milk} | {no milk} | 0.0012201 | 1.0000000 | 0.0012201 | 1.343212 | 12 |
| {canned beer,sausage,shopping bags} | {no milk} | 0.0025419 | 0.9615385 | 0.0026436 | 1.291550 | 25 |
| LHS | RHS | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|
| {rice,sugar} | {whole milk} | 0.0012201 | 1.0000000 | 0.0012201 | 3.913649 | 12 |
| {flour,root vegetables,whipped/sour cream} | {whole milk} | 0.0017285 | 1.0000000 | 0.0017285 | 3.913649 | 17 |
| {oil,other vegetables,root vegetables,yogurt} | {whole milk} | 0.0014235 | 1.0000000 | 0.0014235 | 3.913649 | 14 |
| {butter,domestic eggs,other vegetables,whipped/sour cream} | {whole milk} | 0.0012201 | 1.0000000 | 0.0012201 | 3.913649 | 12 |
| {cream cheese ,other vegetables,sugar} | {whole milk} | 0.0015252 | 0.9375000 | 0.0016268 | 3.669046 | 15 |
| {root vegetables,sausage,tropical fruit,yogurt} | {whole milk} | 0.0015252 | 0.9375000 | 0.0016268 | 3.669046 | 15 |
| {citrus fruit,domestic eggs,sugar} | {whole milk} | 0.0014235 | 0.9333333 | 0.0015252 | 3.652739 | 14 |
| {domestic eggs,sugar,yogurt} | {whole milk} | 0.0014235 | 0.9333333 | 0.0015252 | 3.652739 | 14 |
| {cream cheese ,pip fruit,whipped/sour cream} | {whole milk} | 0.0013218 | 0.9285714 | 0.0014235 | 3.634103 | 13 |
| {other vegetables,rice,root vegetables,yogurt} | {whole milk} | 0.0013218 | 0.9285714 | 0.0014235 | 3.634103 | 13 |